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Abstract 

We  discuss  two  physical  systems  from  separate  disci¬ 
plines  that  make  use  of  the  same  algorithmic  and  math¬ 
ematical  structures  to  reduce  the  number  of  operations 
necessary  to  complete  a  realistic  simulation.  In  the  grav¬ 
itational  N-body  problem,  the  acceleration  of  an  object 
is  given  by  the  familiar  Netwonian  laws  of  motion  and 
gravitation.  The  computational  load  is  reduced  by  treating 
groups  of  bodies  as  single  multipole  sources  rather  than  in¬ 
dividual  bodies.  In  the  simulation  of  incompressible  flows, 
the  flow  may  be  modeled  by  the  dynamics  of  a  set  of  iVT 
interacting  vortices.  Vortices  are  vector  objects  in  three  di¬ 
mensions,  but  their  interactions  are  mathematically  similar 
to  that  of  gravitating  masses.  The  multipole  approximation 
can  be  used  to  greatly  reduce  the  time  needed  to  compute 
the  interactions  between  vortices. 

Both  types  of  simulations  were  carried  out  on  the  Intel 
Touchstone  Delta,  a  parallel  MIMD  computer  with  512 
processors.  Timings  are  reported  for  systems  of  up  to  10 
million  bodies,  and  demonstrate  that  the  implementation 
scales  well  on  massively  parallel  systems.  The  majority  of 
the  code  is  common  between  the  two  applications,  which 
differ  only  in  certain  “physics”  modules.  In  particular,  the 
code  for  parallel  tree  construction  and  traversal  is  shared. 

1  Introduction 

Tree-based  algorithms  have  had  a  major  impact  on  the 
study  of  the  evolution  of  gravitating  systems  because  they 
provide  a  method  of  computing  the  mutual  interactions 

'Submitted  to  the  International  Journal  of  Supercomputer  Applications, 
Aug  6,  1993 

^Department  of  Physics,  University  of  California,  Santa  Barbara 
’new  address:  Dept,  of  Mechanical  Eng.,  Univ.  of  Sherbrooke,  2300 
Boulevard  University,  Sherbrooke  JIK  2R 1,  Qudbec,  Canada 


of  N  bodies  in  much  less  than  0{N^)  time.  TVee  codes 
have  been  reported  to  scale  as  0{N)  or  0(JVlog  JV),  but 
the  “big-0"  notation  can  be  misleading  for  practical  val¬ 
ues  of  N.  where  true  performance  is  dominated  by  the 
constants  fliat  ate  discarded  by  the  asymptotic  analysis. 
Large-scale  application  of  tree-based  approximation  meth¬ 
ods  has  (to  our  knowledge)  only  occurred  in  astrq>hysics, 
(41, 16, 40, 24]  although  prelimiruuy  work  has  been  done 
on  two-dimensional  [33,31,21, 19],  and  three-dimensional 
systems  [22, 39, 8, 15]  from  other  disciplines. 

One  roadblock  to  the  widespread  acceptance  of  tree 
codes  is  that  they  are  inherently  complex  to  program,  es¬ 
pecially  for  parallel  machines.  We  report  on  an  implemen¬ 
tation  of  a  tree  code  which  is  not  specific  to  a  particular 
problem  domain.  Although  designed  with  astrophysical 
research  firmly  in  mind,  the  code  described  here  is  used 
to  address  example  problems  in  vortex  dynamics  as  well 
as  astrophysics.  We  report  on  its  performance  on  the  In¬ 
tel  Touchstone  Delta  system  wifli  up  to  512  distributed- 
memory  processors. 

Two  outcomes  are  possible  when  algorithmic  advances 
drastically  reduce  the  time  and  space  required  to  solve  a 
class  of  problems.  The  first  is  that  the  problems  cease  to  be 
“supercomputer  applications”,  and  fall  into  the  domain  of 
workstations  and  perscmal  computers.  The  second  is  that 
practitioners  gradually  advance  the  state-of-the-art  in  die 
underlying  discipline,  and  much  larger  problems  become 
the  norm.  In  the  latter  case,  supercomputers  remain  a 
critical  component,  and  it  is  important  to  address  issues  of 
whether  the  new  algorithm  is  well-suited  to  supercomputer 
architectures,  i.e.,  massively  parallel  systems.  The  second 
outcome  has  certainly  been  the  case  in  astrophysics,  where 
state-of-the-art  simulations  now  evolve  systems  of  lO’-lO* 
bodies.  We  expect  it  will  occur  in  other  fields  as  tree-based 
methods  become  generally  available. 

2  The  Gravitational  N-body  Problem 
2.1  Mathematics 

While  it  is  possible  to  describe  tree  methods  in  terms 
of  broad  generality  and  abstraction,  we  find  it  helpful  to 
begin  witli  a  concrete  example  and  to  develop  the  abstrac¬ 
tion  in  stages.  (Not  coincidentally,  this  is  also  how  our 
understanding  of  the  problem  evolved).  Newton  himself 
would  find  the  underlying  mathematics  of  the  gravitational 
N-body  problem  quite  (Lilian  In  modem  notation,  New¬ 
ton’s  law  of  gi  avitation,  and  his  second  law  of  motion  are: 

^x'’(f)  =  -V</>(xP(t),f)  i,j  =  (\) 
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We  have  chosen  units  in  which  the  gravitational  constant, 
Cr  =  1.  When  evaluated  at  the  position  of  one  of  the  parti¬ 
cles,  x^{t),  the  expression  in  Eq.  2  is  clearly  singular.  The 
summation  must  be  understood  not  to  include  the  “self- 
interaction”  of  a  mass-point  with  itself.  Alternatively,  it 
is  possible  to  artificially  “smootii”  the  Netwonian  inter¬ 
action,  so  that  the  gravitational  potential  between  nearby 
bodies  is  bounded.  This  procedure  also  has  the  side-effect 
of  removing  “collisions”  which  is  often  desirable  from  a 
numerical  orphysical  point  of  view  [23, 17].  Wecandefine 
a  smootiied  Green’s  function,  Ga{x)  =  and  a 

smoothed  potential: 

1 

Many  choices  are  available  for  the  smoothed  Green’s  func- 
tiotL  For  the  tests  reported  here,  we  use  the  Plummer 
smoothing:  G{p)  = 

Equations  1  and  2  or  3  constitute  a  system  of  second- 
order  ordinary  differential  equations.  As  such,  it  is  not 
particularly  difficult  to  integrate  in  time  numerically.  The 
computationally  challenging  part  is  the  evaluation  of  N 
right-hand-sides,  each  of  which  is  a  sum  of  iV  -  1  terms. 
The  gravitational  force-law  Is  long-range,  i.e.,  the  force- 
law  falls  off  slowly  enough  that  contributions  from  distant 
objects  cannot  be  assumed  to  vanish.  Thus,  it  is  not  per¬ 
missible  to  simply  disregard  contributions  from  bodies  that 
more  distant  than  some  prescribed  cutoff. 

We  can  turn  again  to  Newttxi,  at  least  for  the  first  term 
in  the  solution.  Let  us  imagine  that  the  point  x  is  well- 
separated  (in  a  sense  defined  below)  from  a  spatially  local¬ 
ized  subset  of  the  rest  of  the  points,  S.  Then  the  sum  (at 
least  over  the  bodies  in  S)  may  be  approximated  by: 

Em’  _  Ms 
||X-X«||  “  ||x-Xcm|l 

■  ^  Qg  ~  Xcm)«(x  —  Xcm.)j  ... 

■^2  ||x-Xc„.||S 

where  Ms  is  the  total  mass  in  5,  Xcm  is  its  center-of-mass 
and  Qy  is  the  reduced  quadrupole  moment  of  S  about  it’s 
center  of  mass.  Indeed,  these  are  the  first  three  terms  of  an 
expansion  which  describes  the  subset  of  points  in  terms  of 
their  mass,  center-of-mass,  quadrupole  moments,  octopole 


moments,  etc.  In  general,  die  second  term  in  the  expan¬ 
sion  contains  the  dipole  moment  of  S,  but  since  the  dipole 
moment  vanishes  about  its  center-of-mass,  the  dipole  con¬ 
tribution  to  die  force-law  vanishes  by  construction.  The 
first  term  on  the  right-hand-side  of  Eq.  4  qiproximates 
die  contents  of  5  as  a  point  source  at  its  carter  of  mass. 
Subsequent  terms  correct  for  the  shape  of  the  matter  dis¬ 
tribution  within  S.  In  general,  we  can  keep  as  many  terms 
as  desired  in  the  expansion  of  die  multipole  moments  of 
S.  We  designate  the  highest  retained  order  as  p,  so  p  =  1 
for  die  monopole  approximation  (since  the  dipole  vanishes 
exacdy,  we  may  as  well  say  that  we’ve  “retained”  it),  and 
p  =  2  for  the  quadrupole  approximation. 

Eq.  4  represents  a  major  simplification,  since  its  eval¬ 
uation  requires  a  fixed  amount  of  time,  regardless  of  how 
many  points  are  in  S.  It  is  analogous  to  the  observation 
that,  if  one  wishes  to  know  the  force  exerted  on  an  apple 
by  the  10^  atoms  in  the  Earth,  one  can  approximate 
the  Earth  as  a  point  mass  located  at  its  center-of-mass,  and 
compute  the  force  in  a  very  small  number  of  operations.  It 
is  by  systematic  application  of  Eq.  4  (or  something  math¬ 
ematically  equivalent)  that  all  the  fast  N-body  methods 
reduce  the  overall  complexity  from  O(iV^)  to  something 
much  more  tractable.  Eq.  4,  of  course,  presupposes  diat 
the  aggregate  data,  i.e.,  Ms,  Xem.  etc.,  are  known,  so  any 
method  which  uses  it  must  (a)  compute  the  aggregate  data 
efficiendy  and  (b)  use  each  aggregate  datum  many  times 
to  amortize  the  cost  of  computing  it. 

We  must  also  remember  that  Eq.  4  is  an  approximation. 
It  must  be  used  with  care  as  it  can  introduce  excessive  errors 
into  a  calculation  if  used  inappropriately.  We  have  shown 
[38]  that  the  error  introduced  by  using  an  approximation 
like  Eq.  4  can  be  bounded  as  follows: 

eiiv^(x)ii  < 

’'(<«'+ -If +'>^) 

where  d  =  |x  -  Xcm|.  b  =  max,e5  ||x»  -  Xem||.  'fhe 
moments  Bn  are  defined  as 

=  ^  lix’  -  Xcmirilrn’ll  (6) 

Equation  6  is  essentially  a  precise  statement  of  the  fiict  that 
the  multipole  approximation  is  more  accurate  when: 

•  The  distance  to  the  measurement  point  is  large  (large 
d). 

•  The  sources  are  scattered  over  a  small  region  (small 
b). 
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•  High  order  approximations  are  used  (large  p). 

•  The  multipoles  which  have  been  neglected  are  small 
(small  £p+i). 

Note  that  the  multipole  series  does  not  converge  at  all 
for  d  <  b,  so  this  constitutes  a  precise  statement  of  the 
condition  that  the  point  fj  be  “well-separated"  from  the 
points  in  S.  Furthermore,  notice  that  increasing  the  order 
of  the  multipole  approximation  employed  is  just  one  way  to 
improve  the  approximation.  It  is  an  open  (atKl  extremely 
interesting)  question  whether  high-order  approximations 
are  cost-effective,  or  whether  it  is  simply  better  to  restrict 
use  of  the  approximation  to  larger  values  of  d. 

2.2  Data  Structures 

The  approximation  in  Eq.  4  is  only  part  of  the  story.  A 
mechanism  must  be  found  to  identify  candidate  subsets  of 
bodies,  compute  their  aggregate  parameters  (mass,  center- 
of-mass,  etc.),  and  selectively  apply  the  multipole  approxi- 
maticHi  to  the  evaluation  of  accelerations.  Several  different 
data  structures  have  been  prqxised  including  binary  trees 
(2,  7],  2'*-trees  [3]  (where  d  is  the  dimensionality  of  the 
underlying  space)  and  multigrids  of  fixed  depth  [20, 48]. 
The  tree  structures  are  inherently  adaptive,  which  allows 
them  to  efficiently  model  systems  which  contain  large  den¬ 
sity  contrasts,  while  the  multi-grid  structures  may  enjoy 
scHne  performance  advantages  because  the  fundamental 
objects  are  typically  multi-dimensional  arrays  accessed  in 
regular  and  predictable  patterns,  i.e.,  patterns  suitable  for 
efficient  execution  on  vector  and  super-scalar  processors. 
We  have  elected  to  work  with  adaptive  oct-trees  in  three  di- 
mensicxis  because  the  astrophysical  problems  that  bred  our 
initial  interest  in  the  subject  are  subject  to  extremely  large, 
dynamic  density  contrasts,  making  adaptivity  a  necessary 
feature  of  any  viable  method.  Some  of  die  other  problem 
domains  where  tree-methods  show  promise  may  not  re¬ 
quire  the  adaptivity  inherent  in  our  code.  It  remains  to  be 
seen  whether  the  overheads  associated  with  adaptivity  are 
significant  vis-a-vis  a  highly  optimized  non-adaptive  code. 

An  oct-tree  is  a  partition  of  three-dimensional  space 
into  cubical  volumes.  Each  cube  has  up  to  eight  daugh¬ 
ters,  obtained  by  splitting  tte  cube  in  half  in  each  of  the 
three  cartesian  directions.  Clearly,  in  d  dimensions,  the 
tree  branches  up  to  2'^  ways  at  each  level.  Quad-trees  are 
far  easier  to  illustrate  on  paper  than  oct-trees,  so  we  use 
them  in  the  figures.  A  quad-tree  with  2(X)0  centrally  con¬ 
centrated  bodies  is  shown  in  Figure  1.  Notice  that  the  tree 


^We  usually  refer  to  quad-trees  or  oct-trees  in  two  and  three  dimensions 


is  adaptive  so  that  in  the  center  where  there  is  a  high  density 
of  bodies,  the  tree  is  mote  finely  resolved.  The  adaptivity 
is  achieved  by  building  die  quad-tree  so  that  terminal  cells 
contain  exactly  one  body  [3].  Thus,  die  depth  of  the  tree  is 
approximately  logarithmic  in  the  local  density  of  particles. 


Figure  1:  A  quad-tree  with  2(X)0  centrally  concentrated 
bodies.  The  tree  is  more  refined  in  regions  of  higher  par¬ 
ticle  density  because  of  the  rule  that  a  terminal  cell  can 
contain  only  one  body. 


Each  cell  in  an  oct-tiee  has  topological  properties  (par¬ 
ents,  daughters,  siblings),  geometrical  properties  (spatial 
coordinates,  size),  and  numerical  properties  (mass,  center- 
of-mass,  quadrupole  moment,  etc.).  How  to  represent 
these  properties  in  a  program  presents  the  programmer 
with  several  choices.  The  topological  aspects  of  the  tree 
can  obviously  be  captured  (at  least  on  a  uniprocessor)  with 
pointers.  Complications  arise  in  parallel  with  regard  to  the 
meaning  of  off-processor  pointers  (for  distributed  memory 
systems),  or  synchronization  (for  shared  memory  systems), 
but  these  problems  can  be  overcome  [36].  [42]  has  demon¬ 
strated  large-scale  parallelism  when  the  data  needed  by 
each  processor  can  be  identified  in  advance.  This  method 
has  been  used  in  a  number  of  astrophysical  simulations 
[37. 41]  which  used  a  much  less  rigorous  error  bound  than 
Eq.6. 
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Unfortunately,  Eq.  6  does  not  admit  a  pre-deteimination 
of  which  data  will  be  required  by  each  cell.  This  is  prob¬ 
lematical  for  the  methods  of  [36],  so  a  new  approach  to 
parallelism,  based  on  accessing  data  on  demand  has  been 
implemented.  The  new  method  is  built  chi  the  idea  of 
assigning  to  every  possible  cell  in  the  tree  a  unique,  multi¬ 
word  key.  A  64-bit  key  allows  one  to  identify  every  cell 
in  a  oct-tree  with  21  levels.  This  has  proved  adequate  for 
highly  clustered  simulations  with  up  to  10^  bodies.  The 
set  of  all  possible  keys  is  clearly  much  larger  than  the  set 
of  keys  that  will  be  present  in  any  given  simulation.  This 
state  of  affairs  suggests  a  hash-table  as  an  appropriate  data 
structure  for  storing  the  tree  data. 

Figure  2  shows  schematically  how  the  keys  designate 
unique  cells  in  the  tree.  The  root  has  key  1.  The  daughters 
of  any  node  are  obtained  by  a  binary  left-shift  of  the  par¬ 
ent’s  key  by  d,  and  then  setting  the  low  bits  to  a  number 
between  0  and  2*'  -  1  to  distinguish  amongst  the  siblings. 
The  parent  of  a  cell  is  obtained  by  right-shift  of  its  key  by 
d.  The  bodies  in  the  simulation  can  be  assigned  unique 
keys  as  well,  simply  by  finding  the  key  corresponding  to 
the  smallest  possible  cell  that  contains  the  given  body. 


Figure  2:  A  four  level  quad-tree,  expanded  to  show  the 
relationship  between  parent  cells  and  daughters.  The  key 
values  of  each  cell  are  shown  in  binary.  Daughter  keys 
are  obtained  from  parents  by  a  two-bit  left-shift,  followed 
by  binary  OR  of  a  daughter-number  in  the  range  00-11 
(binary). 


2.3  Control  Stnictures 

We  have  shown  (at  least  schematically)  how  our  tree  will 
be  represented  in  memory.  We  now  turn  to  how  we  will  use 
the  data  to  evaluate  gravitaticMud  forces.  For  this,  we  must 
traverse  the  tree  data  structure  accumulating  acceptable 
interactions  as  w:  proceed.  The  traversal  is  governed  by  a 
Multipole  Accq)tability  Oiterion  (MAC)  which  identifies 
when  Eq.  4  is  sufficiently  accurate  to  be  used. 

Many  options  are  availaUe  for  the  MAC  [38].  In  this 
paper,  we  elect  to  bound  the  error  introduced  by  each 
multipole  approximation.  Thus,  we  can  use  Eq.  6  directly 
and  allow  ooly  interactions  for  which  the  right-hand-side 
of  Eq.  6  is  less  than  some  prescribed  tolerance. 

To  use  Eq.  6,  we  simply  carry  out  a  top-down  traversal 
of  the  tree  independently  for  each  body,  terminating  the 
descent  whenever  Eq.  6  is  satisfied  by  a  cell.  In  those 
cases,  Eq.  4  is  evaluated  to  compute  die  influence  of  tiie 
contents  of  the  cell  on  the  body.  If  we  reach  a  terminal 
node  of  the  tree  before  Eq.  6  is  satisfied  tiien  individual 
body-body  interactions  are  computed.  This  strategy  forms 
the  basis  of  our  vortex  dynamics  code  (see  Section  3). 

For  the  gravitational  problem,  we  have  improved  on 
this  strategy,  by  noting  that  nearby  particles  feel  almost 
the  same  i^uence  frcnn  distant  objects.  To  extend  our 
“apple”  analogy  further,  we  are  now  taking  note  of  the  fact 
that  two  neighboring  apples  feel  almost  the  same  gravi¬ 
tational  field  from  tiie  multitude  of  elementary  particles 
constituting  the  Earth.  Thus,  if  one  wishes  to  know  the 
gravitational  acceleration  of  two  apples  in  the  same  tree,  it 
may  be  sufficient  to  evaluate  Eq.  4  only  once  and  use  the 
same  answer  for  both.  In  fact,  this  analogy  again  only  cap¬ 
tures  the  first  (constant)  term  in  a  series  expansion.  This 
time,  the  expansion  is  a  Taylor  series  fcH-  Eq.  4  around 
the  point  x.  This  observation  does  not  lead  to  the  dramatic 
speedup  resulting  from  the  initial  observation  that  the  Earth 
itself  may  be  approximated  by  a  point  mass.  Nevertheless, 
a  significant  reduction  in  the  number  of  interactions  is  pos¬ 
sible.  In  order  to  pursue  this  line  of  reasoning,  we  need 
an  error  bound  analogous  to  Eq.  6.  A  detailed  analysis  is 
in  preparation  for  publication  elsewhere.  We  state  here, 
without  proof,  that  the  combined  error  resulting  from  both 
the  dipole  approximation  and  the  linear  term  of  a  Taylor 
series  expansion  of  Eq.  6,  at  a  distance  A  from  x  is  bounded 
by: 

^1  1  2B3\  ,,, 

eiiv^(x)ii  <  (771^  7  d?  ■ 

where 

S(2)  =  B2  -(■  2AB\  -f  A^Bq  (8) 
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Use  of  Taylor  series  to  reduce  the  number  of  evaluations 
of  Eq.  4,  introduces  a  new  class  of  interactimis  into  the 
problem.  Formerly,  Eq.  4  was  all  that  was  required  to 
evaluate  the  interactions  between  bodies  and  the  aggregate 
data  stored  in  cells.  Now  it  is  also  necessary  to  evaluate 
interactions  between  source  cells,  i.e.,  the  aggregate  data 
describing  the  field  sources,  mass,  center-of-mass,  etc.,  and 
sink  cells,  i.e.,  the  coefficients  of  a  Taylor  series  expansion 
around  some  arbitrary  point.  In  addition,  we  must  conpute 
translations  of  the  origin  of  a  Taylor  series  expansion  from 
the  center  of  a  parent  sink  cell  to  the  centers  of  its  daughters. 

We  can  compute  the  field  at  every  body  position  by  loop¬ 
ing  over  all  of  the  body  positions  in  order  of  increasing  key. 
For  each  body,  we  find  the  nearest  common  ancestor  with 
the  previous  body.  Any  Taylor  expansions  above  that  com¬ 
mon  ancestor  remain  valid  for  the  new  body,  while  those 
in  the  intervening  levels  must  be  computed  by  translat¬ 
ing  the  Taylor  expansion  of  the  parent,  and  accumulating 
any  cell-cell  interactions  that  satisfy  the  extended  cell-cell 
MAC  (based  on  Eq.  7).  It  is  crucial  for  bodies  to  be  sorted 
in  hash-key  order  because  that  guarantees  a  high  corre¬ 
lation  between  the  spatial  positions  of  successive  bodies, 
and  hence  only  a  small  number  (one  on  average)  of  Taylor 
expansions  will  need  to  be  reccanputed  for  each  new  body. 
Thus,  the  use  of  the  Taylor  series  expansion  reduces  the 
asymptotic  order  of  the  method  to  0\N),  but  this  means 
little  unless  the  constants  of  proportionality  are  also  com¬ 
pared.  Preliminary  results  indicate  that  the  total  number  of 
floating  point  operations  is  about  a  factor  of  6  lower  when 
allowing  cell-cell  interactions  and  using  Eq.  7.  compared 
with  use  of  Eq.  6,  in  a  one  million  body  simulation. 

2.4  Parallelism 

Two  basic  issues  arise  in  parallelizing  the  many  sci¬ 
entific  algorithms  for  distributeu  memory  machines:  de- 
ctxnposition  and  data  acquisition.  Decomposition  refers 
to  the  strategy  used  to  partition  the  problem  amongst  the 
available  processors.  Data  acquisition  refers  to  the  need 
to  communicate  between  processors  so  that  they  have  the 
data  they  need  to  carry  out  the  necessary  computations 
on  their  subset  of  the  data.  Decomposition  is  concerned 
with  load  balance,  i.e.,  arranging  that  all  processors  finish 
approximately  at  the  same  time,  and  with  minimizing  the 
volume  of  communication.  Data  acquisition  is  often  con¬ 
cerned  with  minimizing  or  hiding  the  large  latency  associ¬ 
ated  with  every  interprocessor  message.  This  is  achieved 
by  overlapping  communication  and  computation  when  the 
hardware  supports  it,  and/or  by  buffering  messages  into 
large  blocks  which  can  be  sent  with  only  a  single  latency 
overhead. 


We  have  already  seen  that  it  is  desirable  to  sort  the  bod¬ 
ies  in  our  simulation  into  a  sequence  based  on  the  hash- 
uble  key.  We  return  to  the  concept  of  a  key-sorted  list 
of  bodies  to  describe  our  parallel  decomposition.  Imagine 
that  the  bodies  are  already  sorted  in  increasing  hash-key 
order.  We  simply  break  the  list  into  segments  and  assign 
each  setment  to  a  processor.  That  is,  a  processor  is  respon¬ 
sible  for  computing  forces  on  a  group  of  bodies  that  fmms 
a  contiguous  segment  of  the  sorted  list  of  bodies.  Fig¬ 
ure  3  shows  a  path  which  traces  out  values  of  increasing 
key  in  a  quad-tree  with  depth  duee.  The  decmnposition  is 
achieved  by  cutting  the  curve  into  P  (number  of  proces¬ 
sors)  segments  of  equal  cost. 


Figure  3:  A  curve  that  traces  increasing  values  of  the 
bash-fiinction  key  in  a  quad-tree  of  depth  three.  The  de- 
compositicHi  is  obtained  by  locating  particles  on  this  curve 
(more  precisely,  on  the  analogous  curve  that  fills  the  low¬ 
est  level  of  the  tree),  and  assigning  contiguous  segments 
to  processors. 


By  dividing  the  path  into  equal-cost  pieces,  we  allow 
for  the  possibility  that  some  bodies  are  more  expensive 
than  others.  The  cost  of  each  body  is  a  measure  of  how 
much  cpu  time  is  required  to  compute  the  forces  on  it. 
In  a  highly  clustered,  adaptive  tree,  the  ratio  of  the  most 
expensive  to  least  expensive  body  can  be  as  high  as  20, 
so  it  is  imponant  to  balance  actual  cost,  rather  than  just 


particle  number.  We  determine  the  dec(Hnposition-cost  of 
each  body  empirically  by  recording  how  many  interactions 
it  required  <mi  the  previous  timestep.  On  the  first  timestep, 
we  assign  every  b^y  equal  cost  This  can  lead  to  consid¬ 
erable  load  imbalance  on  the  first  timestep,  but  the  cost  is 
negligible  over  the  course  of  a  simulation  which  typically 
requires  hundreds  or  thousands  of  timesteps. 

This  decomposition  has  the  advantage  that  it  is  can  be 
generated  with  a  general-purpose  parallel  sort  routine,  and 
that  it  leads  to  spatial  locality  in  the  decomposition.  The 
latter  property  reduces  the  amount  of  interprocessor  com- 
murucation  traffic,  as  spatially  nearby  bodies  tend  to  re¬ 
quire  the  same  off-processor  information.  In  contrast  to  the 
decomposition  us^  in  [36],  boundaries  between  proces¬ 
sors  correspond  to  divisions  within  the  tree  data  structure. 
Notice  also  that  from  one  timestep  to  the  next,  the  bod¬ 
ies  do  not  move  significantly,  so  that  on  every  timestep 
but  the  first,  the  sorting  subroutine  is  presented  with  an 
almost-sorted  input. 

The  second  issue  in  parallelism  is  the  acquisition  of 
off-processor  data.  One  strategy,  discussed  in  [36],  is 
to  acquire  all  the  necessary  off-processor  data  in  a  com- 
muiucation  phase  prior  to  the  computation.  This  has  the 
advantage  of  leaving  the  main  loop  of  the  computation 
essentially  untouched,  allowing  for  re-use  of  higUy  opti¬ 
mized  sequential  code.  This  technique  also  requires  that  it 
be  possible  a  priori  to  determine  which  data  will  be  needed 
by  which  processors.  Unfortunately,  when  Eq.  6  is  used 
for  die  MAC,  it  is  not  possible  to  pre-determine  which 
data  will  be  necessary.  Thus,  we  adopt  a  new  strategy  of 
acquiring  off-processor  data  on  demand. 

This  new  scheme  requires  a  small  modification  of  the 
hash-table  data  structure  used  to  represent  the  tree.  In  par¬ 
allel,  tile  hash  table  lookup  functions  can  return  in  three 
possible  ways:  they  can  find  the  requested  item,  they  can 
report  that  tiie  requested  item  does  not  exist,  or  they  can 
rqxHt  that  the  requested  iton  exists  in  another  processor’s 
memory.  In  the  last  case,  it  is  left  up  to  the  caller  how  to 
proceed.  The  simplest  approach  would  be  to  simply  ex¬ 
ecute  a  remote  procedure  call  to  retrieve  the  remote  data. 
This  would  be  unacceptably  slow,  as  it  would  entail  sev¬ 
eral  times  the  the  full  message  latency  for  every  remote 
data  access.  An  alternative  is  to  enqueue  a  request  to  be 
dispatched  at  a  later  time  (after  many  more  requests  have 
been  enqueued),  and  to  proceed  with  other  branches  of  the 
tree,  or  with  other  particles.  This  greatly  reduces  the  total 
number  of  messages  sent,  and  hence  the  overhead  due  to 
message  latency.  In  addition,  tiiis  approach  offers  the  pos¬ 
sibility  of  almost  totally  overlapping  communication  with 
calculation,  because  if  the  hardware  supports  simultane¬ 


ous  cranmunication  and  calculatitm,  the  code  can  easily 
take  advantage  of  it.  On  the  other  hand,  it  is  necessary  to 
make  substantial  changes  to  the  critical  inner  loop  of  tiie 
tree  traversal  routine,  to  allow  for  deferred  processing  of 
off-processor  cells,  and  simultaneous  processing  of  several 
bodies  [43]. 

2.5  Performance 

We  report  here  some  results  obtained  with  the  gravi¬ 
tational  code  on  the  Intel  Touchstone  Delta  system.  Al¬ 
though  the  Delta  has  S76  processors  (both  80386  and  i860), 
a  maximum  of  512  of  the  40Mhz  i860s  may  be  assigned 
to  a  single  job  at  any  one  time  (the  others  ate  respcmsible 
for  various  system  tasks  like  managing  peripheral  devices, 
user  logins,  etc.).  Each  processor  has  16Mb  of  memory  and 
is  connected  to  four  neighbras  through  a  two-dimensional 
mesh  routing  network.  Interprocessor  communication  is 
accomplished  by  explicit  system  calls  embedded  in  the  ap¬ 
plication’s  C  or  FORTRAN  code.  There  is  no  hardware 
support  for  shared  memory.  The  code  is  written  entirely 
in  ANSI  C.  and  has  been  ported  to  several  other  paral¬ 
lel  platforms,  including  the  CM-5.  the  SP-1,  workstation 
networks,  the  Intel  Paragon  and  the  Ncube-2.  Notably, 
the  code  also  compiles  and  runs  with  no  modifications 
whatsoever  on  sequential  platforms  tiiat  supply  an  ANSI 
C  compiler. 

We  report  results  that  use  a  monopole  approximation  for 
the  force  law,  and  a  first-order  Taylor  series  for  translation. 
Thus,  the  first  neglected  error  term  is  always  inversely  pro¬ 
portional  to  the  square  of  the  distance  between  the  source 
and  the  observation  point.  Errors  in  the  acceleration  are 
analytically  bounded,  using  Eq.  7,  to  be  less  than  2%  of 
the  force  at  the  edge  of  the  spterical  domain.  In  practice, 
the  errors  were  typically  much  less  than  2%. 

The  particle  distribution  is  taken  fitxn  a  series  of  sim¬ 
ulations  designed  to  simulate  die  formatitm  of  large-scale 
structure  in  the  universe’.  We  chose  a  n^resentative  distri¬ 
bution  from  late  in  the  evolution  which  displays  significant 
clustering.  The  density  contrast  between  die  most  dense 
and  least  dense  regions  in  the  simulation  is  ^proximately 
10^.  The  particles  repre^nt  the  so-called  “Dark  Matter” 
which  is  believed  to  dominate  the  mass  of  the  universe.  The 
region  simulated  is  a  sphere  lOMpc  in  diameter,  containing 
1099135  bodies  Each  body  has  a  mass  of  approximately 
3.3  X  10’  solar  masses,  which  is  appreciably  less  than  the 
mass  of  a  typical  galaxy  (10"  -10*’  solar  masses).  To  gen¬ 
erate  data  sets  with  fewer  bodies  we  simply  chose  bodies 
at  random  from  the  full  data  set.  Uliile  this  procedure 


^Iiis  limeslep  S40  of  model  128m.n-lb  in  [41] 
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may  be  questicxuble  irmn  a  physical  point  of  view,  it  en¬ 
sures  that  die  scaling  behavior  as  a  funcdoo  of  N  will  not 
be  ccmtaminated  by  dififerences  arising  from  different  spa¬ 
tial  distributions  of  bodies.  Figure  4  shows  approximately 
87000  of  the  bodies  in  our  data  set. 


Hgure  4:  Points  are  located  at  the  (x,y)  projection  of  the 
three-dimensional  positions  of  the  bodies  in  our  example 
N-body  problem.  The  figure  shows  8699S  bodies  chosen 
at  random  from  die  full  data  set.  The  region  shown  is 
lOMpc  across. 


ngure  5  shows  timings  for  data  sets  with  N  = 
5000,  10000,  20000,  50000,  100000,  200000,  500000 
and  1099135  bodies  on  1,2,4,. . .  ,512  processor  subsets 
of  the  Delta.  The  times  rqxirted  are  the  total  time  per 
timestqi  (force  evaluation,  velocity  update,  position  up¬ 
date),  averaged  over  2  to  5  timesteps  (more  for  less  time- 
consuming  runs).  We  do  not  count  the  first  time  step  be¬ 
cause  it  suffers  from  an  anomalous  load  imbalance  which 
is  corrected  in  later  steps  after  the  “cost"  of  each  particle 
is  known.  The  variation  between  timesteps  was  typically 
a  few  percent,  except  for  the  first,  which  typically  took 
about  twice  as  Icmg  as  the  others.  The  machine  was  not 
dedicated  during  these  benchmarks,  so  variations  may  be 


due  to  the  activity  of  other  users  or  inherent  variaticMis  from 
one  timestep  to  the  next 


Figure  5:  Running  time,  in  seconds  for  each  iteration  of 
the  gravitational  N-body  problem.  Hmes  are  reported  for 
processor  number.  P  =  1,2,...  ,512.  They  are  ccnnputed 
by  averaging  2-5  succesive  iterations.  In  all  cases,  the 
initial  timestep  is  not  included  because  it  suffers  from  load 
imbalance  and  I/O  startup. 


The  fact  that  the  curves  are  parallel  and  well-separated 
for  large  N  indicates  that  adding  processors  really  does 
reduce  the  execution  time.  In  order  to  remove  die  dominant 
trends  in  the  data,  we  plot  in  Figure  6  the  quantity  ^ 
against  N.  In  this  figure,  “perfect”  parallelism  wo^d  bo 
represented  by  all  curves  lying  on  top  of  the  extrapolation 
of  the  P  =  1  curve.  It  is  clear  that  the  P  =  512  curve  is 
still  approaching  the  odiers,  i.e.,  we  have  not  yet  reached 
the  large-iV  limit,  even  with  over  one  million  bodies.  In 
addition.  Figure  6  shows  that  the  scaling  of  T  with  N 
is  slightly  super-linear  for  the  range  of  N  studied.  In 
fact,  it  is  close  to  T  a  but  it  is  very  difficult  to 
distinguish  between  slightly  different  functional  forms  of 
the  asymptotic  scaling  behavior  with  only  four  P  =  1  data 
points. 

We  have  also  run  a  series  of  benchmarks  with  a  uniform, 
spherical  distribution  of  points  (not  shown).  In  Figure  7, 
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we  plot  ^  against  N  up  to  10  millioa  unifonnly  distrib¬ 
uted  bodies,  but  only  up  to  256  processois  (the  full  machine 
was  not  available  before  press  time).  The  force  evaiuaticms 
are  3-S  times  fasterfor  the  uniform  distribution  of  particles, 
compared  widi  the  highly  clustered,  evolved  astn^hysical 
data.  In  additicui,  the  N-dependence  appears  to  be  very 
close  to  T  oc  JV.  A  detailed  analysis  of  the  dependoice 
of  the  time-per-timestep  on  the  particle  distributicm  will  be 
presented  elsewhere. 


Hgure  6:  The  same  data  as  Figure  5,  but  the  abscissa  is 
The  vertical  distance  from  the  (extrapolation  of  the) 
P  =  1  curve  to  the  other  curves  is  a  good  indicator  of 
parallel  overhead.  In  addition,  these  curves  indicate  that 
the  time  per  timestep  scales  slightly  super-linearly  with 
N.  It  is  clear  drat  parallel  overhead  has  still  not  rearmed  its 
“large-N  limit”  with  1  million  bodies  on  512  processors. 


Figure  7:  Timing  data  for  a  uniform,  spherical  distribution 
of  bodies  plotted  with  an  abscissa  of  The  vertical 
distance  from  die  (extrapolation  of  the)  P  =  1  curve  to  die 
other  curves  is  a  good  indicator  of  parallel  overhead.  For 
uniform  data,  the  N  dependence  is  very  close  to  linear,  and 
parallel  overhead  is  less  than  50%  for  lOmillion  bodies  cm 
256  processors. 


3  The  Vortex  Particle  N-Body  Problem 

The  vorticity  equation  (w  =  V  x  u,  and  hence  V  •  a;  = 
0)  for  an  incompressible  fluid  (V  ■  u  =  0)  is  obtained  from 
taking  the  curl  of  the  momentum  equation: 


where  ^  =  If  +  (u  •  V)  /  is  the  Lagrangian  derivative 
and  1/  is  the  Idiomatic  viscosity.  'Fhe  vorticity  equation 
is  thus  a  nonlinear  transport  equation  which  can  be  solved 
using  a  particle  method.  In  tl^  regularized  version  of  the 
method.  [32.  11.  12.  34.  35.  27.  5.  6.  29.  1.  28.  4.  9.  14. 
10.  13.  44.  45.  46.  47]  the  particle  representation  of  the 
vorticity  held  is  taken  as: 

=  5^  Ca(x-X«(#))  (w*(t)vol’)  (10) 

4 

=  X)Ux-x*(0)7»(<)- 
1 

where  C<r  is  a  radially  symmetric  regularization  function 
and  0-  is  a  smoothing  radius  (i.e..  a  core  size):  (^(x)  = 

irC  (^)  with  the  normalization  ((p)  dp  = 

1  .  Notice  that  does  not  constitute  a  generally 
divergence-free  "basis". 

The  velocity  held  is  computed  from  the  particle  repre¬ 
sentation  of  the  vorticity  field  as  the  curl  of  a  vector  stream- 
function.  u.r  =  (hence  V  •  u.,  =  0).  The  vector 

streamfunction  thus  satisfies:  V^^,(x,f)  =  -<5a.(x,t). 

Defining  G»^(x)  =  with  G{p)  such  that 

=  =  «i) 

one  obtains 

^,(x,t)  =  5;G,(x-x«(f))7n*)  (12) 

« 

u,(x,t)  =  j;;  (VG,(x-x«(f)))  X7«(t),(13) 
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The  Gaussian  smoothing  is  used: 


47rC(p)  = 

(14) 

47r  G(p)  = 

(15) 

It  leads  to  a  second  order  numerical  method,  provided  the 
core-overlapping  condition  remains  satisfied  (i.e.,  f  >  1 
where  h  is  a  typical  spacing  in  between  particles).  We 
approximate  erf(^)  «  1  for  p  >  4,  i.e.,  for  l|x||  >  4o-, 
we  have  47r(?ff(x)  = 

In  the  inviscid  case,  the  evolution  equations  for  the 
particle  position  and  strength  vector  are  taken  as 

^x«(t)  =  u„.(x«(t),t) ,  (16) 

=  (Vu.(x*(<),t)).7’(f).  (17) 


The  evaluation  of  the  velocity  field  induced  by  a  system 
of  vortex  particles  is  thus  very  sin  Jlar  to  the  evaluation  of 
the  acceleration  field  induced  by  a  system  of  point  masses 
in  gravitation.  Eq.  13  is  structurally  almost  identical  to 
Eq.  3.  One  need  only  replace  die  scalar  particle  mass  m< 
with  the  vector  vorticity  strength.  7’,  and  the  real  valued 
multiplication  with  a  vector  cross-product  Just  as  with  the 
gravitational  case,  it  is  possible  to  approximate  the  summa¬ 
tion  in  Eq.  13  with  an  expression  similar  to  Eq.  4  involving 
the  multipole  moments  of  the  vorticity  distribution.  The 
assymptotic  behavior  of  the  Green’s  fiunction,  is  even 
the  same,  which  allows  us  to  re-use  much  of  the  machinery 
involving  eiror  bounds  from  [38].  A  vortex  particle  code 
is,  however,  more  costly  than  a  gravitational  code:  ( 1 )  The 
particle  strength  and  potential  fields  are  vectors  rather  than 
scalars,  so  the  potential  evaluatitm  is  immediately  three 
times  as  cosdy.  (2)  Gne  must  evaluate  both  the  first  and 
second  derivatives  of  the  vector  streamfunction  in  order  to 
obtain  both  the  velocity  vector  and  the  velocity  gradient 
tensor,  which  appears  in  the  right-hand  side  of  17. 

Other  difficulties  arise  from  the  interpretaticm  of  the 
vortex  particles  as  representing  a  continuum.  Hence,  one 
needs  to  use  a  smoothing  functicm  that  leads  to  conver¬ 
gence.  e.g.,  the  Gaussian  smoothing,  which  is  consider- 
aUy  more  cosdy  than  the  Plummer  smoothing  used  in  the 
gravitational  code.  In  vortex  simulations,  one  also  needs 
to  ensure  that  the  smoothed  vortex  particles  continue  to 
overlap  for  the  duration  of  the  simulatitm.  Hence  particle 
"redistribution"  (from  a  deformed  set  of  particles  onto  a 
new  set  of  regularly  spaced  particles)  nuy  become  nec¬ 
essary  in  long  time  computations  [25,  26].  Finally,  the 
particle  field.  Ci>^,  is  not  guaranteed  to  remain  a  good  rep¬ 
resentation  of  the  divergence-free  field,  =  V  x  u,. 
for  all  times.  Hence,  in  long  time  computations,  one  may 
need  to  "relax"  the  particle  weights  so  that  G>„  remains  a 
good  representation  of  w,  [44. 47, 30].  These  considera¬ 
tions  apply  to  any  vortex  method  and  are  not  significandy 
affected,  one  way  or  the  other,  by  use  of  a  tree  code  to 
carry  out  the  field  summations. 

When  using  multipole  expansions  up  top  =  2,  (mono¬ 
pole  +  dipole  +  quadrupole),  estimates  for  the  error  on  tp 
and  u  =  V  X  ^  at  an  evaluation  point  x,  a  distance  d  from 
the  center  Xe  of  a  multipole  expansion  are  obtained  as: 


eiiuii  < 


1  1  B3 


1  6  B2 
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,  1  1  rsa,  B|| 

.  (19) 

-  d*(l-*)M  d’  Bod^l 

wherexc.b  =  ||x«  -  x,  Bn  and  are  box  properties: 

Bo  =  ElM’ 

q 

(20) 

=  ^Ilx«-x.|ni7’|l. 

q 

(21) 

We  choose 

EJIt’II  x» 

Xc=  * 

Bo 

(22) 

the  centroid  of  the  absolute  value  of  the  particle  strengths, 
as  the  center  of  our  multipole  expansion.  This  choice  ana¬ 
lytically  minunizes  B2,  and  hence  minimizes  the  bound  on 
the  error  introduced  by  the  multipole  expansion.  Notice 
that  the  dipole  term  of  this  multipole  expansion  does  not 
vanish.  The  dipole  moment  vanishes  only  in  the  gravita¬ 
tional  case  because  the  masses  are  positive  definite  scalars, 
and  the  centroid  of  the  absolute  value  of  the  masses  is  iden¬ 
tical  with  their  center  of  mass.  The  same  non-vanishing  of 
the  dipole  contribution  will  occur  in  electrostatics  where 
each  particle  has  a  scalar  electric  charge  which  may  be  of 
either  sign.  The  dipole  moment  of  a  collection  of  such 
charges  does  not  vanish  when  die  expansion  is  centered  at 
the  centroid  of  the  charge  absolute  values. 

We  carried  out  a  series  of  timings  for  a  problem  rep¬ 
resenting  the  evolution  of  an  initially  spherical  vorticity 
distribution.  Figure  8  shows  the  initial  positions  of  vortex 
particles  representing  a  surface  vorticity  of  thickness,  h, 
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fA  =  hu=  —  sin{e)%^.  (23) 

The  problem  was  discretized  by  recursively  splitting  the 
faces  of  an  icosahedron,  and  then  projecting  them  onto  a 
unit  sphere.  The  strength  of  each  vortex  parti  cle  was  taken 
as  7  =  fxA,  where  A  is  die  area  of  the  projected  triangle, 
and  /I  is  the  value  of  Eq.  23  at  its  centroid.  The  thickness 
was  chosen  to  be  equal  to  the  length  of  die  sides  of  the 
triangles  before  projection.  By  terminating  the  recursive 
splitting  at  different  levels,  we  obtained  the  discretizations 
shown  in  Table  1.  The  core-size  was  taken  a  =  h. 
Figure  9  shows  the  81920  vortex  system  evolved  to  f  = 
2.S0  (after  1 00  timesteps).  Note  that  although  we  depict  the 
particles  as  points,  they  actually  carry  a  three-component 
strength  vector  7.  Representing  the  particles  as  vectors 
merely  clutters  the  figure.  The  simulaticms  were  done 
using  a  “sum”  error  tolerance  (see  [38])  based  on  Eq.  19, 
which  guarantees  that  the  total  error  in  the  velocity  is  less 
than  0.001. 


■I  0  1 


Figure  8:  The  positions  of  81920  vortex  particles  initially 
on  the  surface  of  a  unit  sphere.  Each  of  the  particles 
carries  a  vector  vorticity,  fi  =  hu  =  which  is 

not  shown  for  clarity.  This  simulation  is  slightly  different 
from  the  the  ones  for  which  timings  are  present^  because 
it  uses  a  core-size  (r  —  0.05. 
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Figure  9:  The  vortex  simulation  of  Figure  8,  evolved 
through  100  timesteps  to  t=2.5. 
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Timings  are  shown  in  Figure  10  for  various  P  = 
1,2,4,... S 12,  and  for  the  discretizations  shown  in  Ta¬ 
ble  1.  The  timings  correspond  to  the  total  wallclock  time 
per  iteration,  i.e.,  the  time  spent  in  parallel  decomposition, 
computing  the  vector  stream  function  V*.  the  velocity  u, 
and  the  gradient  of  u  at  each  particle  position,  as  well  as 
updating  the  particle  positions  and  strengths  according  to 
Eq.  16,  and  17.  As  with  the  gravitational  code,  we  report 
an  average  over  2-S  iterations,  and  we  do  not  include  the 
first  timestep  because  of  its  anomalous  parallel  load  imbal¬ 
ance.  Comparistm  with  the  gravitational  timings  confirms 
the  claim  that  the  vortex  code  is  somewhat  more  costly 
than  the  gravitational  problem 


Figure  10:  lime  per  timestep  for  the  vortex  method  vs. 
number  of  vortices,  averaged  over  several  timesteps. 


The  same  data  is  replotted  with  an  abscissa  of  ^  in 
Figure  1 1,  to  better  illustrate  the  dominant  trends.  Here,  it 
is  clear  that  one  millitm  bodies  clearly  is  in  the  “large-N” 
limit,  and  the  parallel  overhead  (obtained  by  measuring  the 
difference  between  the  P  =  512  curve  and  the  extrapola¬ 
tion  of  the  P  =  1  curve)  is  in  the  neighborhood  of  20%. 
Thus,  although  the  vortex  simulation  is  overall,  somewhat 
slower  than  the  gravitational  simulation,  it  makes  more  ef¬ 
ficient  use  of  the  parallel  hardware  (a  fact  that  is  of  small 
consolation  to  the  user  with  a  limited  computational  bud¬ 
get).  Figure  1 1  also  demonstrates  that  the  scaling  behavior 


level 

h 

N 

4 

0.0657 

5120 

5 

0.0329 

20480 

6 

0.0164 

81920 

7 

0.00821 

327680 

8 

0.00411 

1310720 

Tablel:  Parameters  for  the  series  ofvortex  particle  method 
simulations. 

with  N  is  again  slightly  super-linear.  This  time  the  expo¬ 
nent  is  approximately  T  a 


Figure  11:  The  same  data  as  Figure  10,  but  plotted  wifii 
an  abscissa  of  ^ .  The  parallel  overhead  can  be  estimated 
by  measuring  the  d'lference  between  a  P  5^  1  curve  and 
the  (extrapolation  of  the)  P  =  1  curve.  The  large  N 
dependence  is  again  slightly  super-linear  over  the  range  of 
N  studied. 


4  Conclusion 

The  two  methods  described  here  demonstrate  that  tree 
codes  are  versatile  and  scale  well  on  large  parallel  com¬ 
puters.  Despite  the  complexity  of  the  algorithms  involved, 
we  have  been  able  to  use  essentially  the  same  code  to  solve 
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problems  in  vortex  dynamics  and  astrophysics.  The  code, 
including  all  auxiliary  software  for  such  tasks  as  random 
number  generation,  fast  approximate  square-root,  a  prim¬ 
itive  debugging  facility,  flexible  timers  and  counters,  etc., 
as  well  as  the  essential  elements:  parallel  quicksort,  multi¬ 
word  key  and  hash-table  manipulation,  tree  construction, 
traversal  and  field  evaluation,  consists  of  about  9000  lines 
of  C  source  code  and  header  files.  Of  this,  about  7000  lines 
are  in  a  common  library  which  is  completely  shared  by  the 
two  applicaticHis  (in  fact,  much  of  the  library  is  not  even 
specific  to  tree-codes).  Of  the  remaining  2(X)0  lines,  about 
700  are  identical  in  the  two  applications.  We  are  pursuing 
further  abstractions  which  will  allow  us  to  separate  these 
into  the  conunon  library  :.s  well. 

It  is  important  to  note  that  the  programs  and  libraries  de¬ 
scribed  here  are  portable  to  other  parallel  supercomputers. 
We  have  run  the  code  on  Intel  Paragon  and  iPSCV860  sys¬ 
tems,  an  IBM  SP-1,  a  CM-J,  an  Ncube-2  and  on  networks 
of  workstations,  as  well  as  the  “degenerate”  parallel  case 
of  a  single  uniprocessor  workstation.  All  that  is  required 
to  support  a  new  system  is  the  creation  of  an  appropriate 
“Makefile”  to  define  such  things  as  the  C-compiler,  linker, 
etc.,  and  the  creation  of  a  single  system-dependent  C  source 
file.  The  system  dependent  file  maps  a  very  small  number 
of  primitive  OS  requests,  e.g.,  what  processor  number  am 
I,  into  system  calls  appropriate  for  the  ta.  jt  architecture. 
The  delta  version  of  this  file  is  400  lines  in  length  and  con¬ 
sists  primarily  of  name-translations  from,  e.g.,  our  internal 
name  Procnum  ( )  to  the  OS-specific  name  myproc  ( ) . 
Porting  this  file  to  a  new  platform  typically  requires  a  few 
hours,  most  of  which  is  spent  searching  the  target  system’s 
manuals  fcff  relevant  system  calls. 

Portability  and  versatility  are  extremely  important  at¬ 
tributes  for  the  parallel  tree  code.  We  are  actively  pursu¬ 
ing  other  application  areas  where  tree  codes  show  great 
promise.  The  electrostatic  interaction  in  molecular  dy¬ 
namics  is  long-range  and  has  the  same  functional  form  as 
Newtonian  gravity.  In  some  circumstances  it  can  domi¬ 
nate  the  time  required  to  carry  out  a  simulation.  In  fact 
the  perceived  cost  of  calculating  electrostatic  interactions 
may  influence  the  choice  of  problems  or  approximations 
which  are  studied.  Several  authors  have  used  tree  codes 
to  address  this  problem,  but,  to  our  knowledge,  none  have 
used  parallel  architectures  and  none  have  used  error  bound 
criteria  like  Eq.  6. 

Integral  equations  in  potential  theory  were  historically 
the  first  application  of  tree  codes  [33].  Related  problems 
give  rise  to  the  largest  problems  currently  addressed  with 
0{N^)  dense  linear  solvers  [18].  We  are  in  the  process 
of  completing  a  tree  code  implementation  of  a  boundary 


integral  equation  solver  which  will  be  incorporated  into 
our  vortex  dynamics  code  to  account  for  the  presence  of 
bluff  bodies  in  the  flow.  The  resulting  program  will  use 
tree  code  codes  in  two  separate  contexts:  to  cotnpute  the 
interactions  between  vonices,  and  to  compute  the  interac¬ 
tions  between  surface  panels  as  pan  of  an  iterative  solver 
to  satisfy  the  boundary  conditions  on  the  surfaces.  The 
interactions  ate  different  in  the  two  contexts,  as  are  the 
spatial  distributitxi  of  sources  and  the  error  criteria,  but  the 
parallel  tree  code  library  provides  much  of  the  necessary 
machinery  for  both  contexts.  Preliminary  results  indicate 
1  million-panel  systems  can  be  solved  on  the  Delta  in  a  few 
minutes.  Such  problems  would  be  completely  intractable 
for  traditional  solvers. 
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